Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowd funding platform focused on creativity. The company’s stated mission is to “help bring creative projects to life”.
Kickstarter has reportedly received almost $6 billion in pledges from 20 million backers to fund more than 200,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.
For this assignment, I am asking you to analyze the descriptions of kickstarter projects to identify commonalities of successful (and unsuccessful projects) using the text mining techniques we covered in the past two lectures.
The dataset for this assignment is taken from webroboto.io ‘s repository. They developed a scrapper robot that crawls all Kickstarter projects monthly since 2009. I noticed that the most recent crawls appear to be incomplete, so we will take data from the last complete crawl on 2021-05-17.
To simplify your task, I have downloaded the files and partially cleaned the scraped data. In particular, I converted several JSON columns, corrected some obvious data issues, and removed some variables that are not of interest (or missing frequently), and removed some duplicated project entries. I have also subsetted the data to only contain projects with locations set to the United States (to have only English language and USD denominated projects). Some data issues surely remain, so please adjust as you find it necessary to complete the analysis.
The data is contained in the file kickstarter_projects_2021_05.csv and contains about 131k projects and about 20 variables.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom 0.7.9 ✓ rsample 0.1.1
## ✓ dials 0.0.10 ✓ tibble 3.1.6
## ✓ ggplot2 3.3.5 ✓ tidyr 1.2.0
## ✓ infer 1.0.0 ✓ tune 0.1.6
## ✓ modeldata 0.1.1 ✓ workflows 0.2.4
## ✓ parsnip 0.1.7 ✓ workflowsets 0.1.0
## ✓ purrr 0.3.4 ✓ yardstick 0.0.8
## ✓ recipes 0.1.17
## Warning: package 'tidyr' was built under R version 4.1.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ stringr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x scales::col_factor() masks readr::col_factor()
## x purrr::discard() masks scales::discard()
## x dplyr::filter() masks stats::filter()
## x stringr::fixed() masks recipes::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## load dataset
setwd('/Users/nikkigerjarusak/Documents/GitHub/assignment-3---kickstarter-nikkigsak/')
df <- read_csv("kickstarter_projects_2021-05.csv")
## Rows: 130768 Columns: 25
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): blurb, country, country_displayable_name, currency, name, slug, s...
## dbl (6): backers_count, converted_pledged_amount, goal, id, pledged, usd_e...
## lgl (3): is_starrable, spotlight, staff_pick
## date (4): created_at, deadline, launched_at, state_changed_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
There are several ways to identify success of a project:
- State (state): Whether a campaign was successful or not.
- Pledged Amount (pledged)
- Achievement Ratio: The variable achievement_ratio is calculating the percentage of the original monetary goal reached by the actual amount pledged (that is pledged\goal *100).
- Number of backers (backers_count)
- How quickly the goal was reached (difference between launched_at and state_changed_at) for those campaigns that were successful.
Use two of these measures to visually summarize which categories were most successful in attracting funding on kickstarter. Briefly summarize your findings.
cat_1 <- df %>%
group_by(top_category) %>%
summarize(total_pledged = sum(pledged))
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
success1 <- ggplot(cat_1, aes(x = top_category, y = total_pledged, fill = top_category)) +
geom_bar(position = "dodge", stat = "identity") + theme(axis.text.x = element_text(angle = 45)) +
labs (x = "Categories", y = "Amount Pledged",
title = "Amount Pledged Per Category")
ggplotly(success1)
In the plot above, we can see that the categories that get the largest pledged amount are by far technology, followed by film & video and games.
cat_2 <- df %>%
group_by(top_category) %>%
summarize(total_backers = sum(backers_count))
success2 <- ggplot(cat_2, aes(x = top_category, y = total_backers, fill = top_category)) +
geom_bar(position = "dodge", stat = "identity") + theme(axis.text.x = element_text(angle = 45)) +
labs (x = "Categories", y = "Number of Backers",
title = "Backers Per Category")
ggplotly(success2)
In measuring success by the number of backers, technology is also the category with the most. However, the discrepancy between technology and the following two categories games and film & video is not as large as when measuring success by amount pledged.
Now, use the location information to calculate the total number of successful projects by state (if you are ambitious, normalize by population). Also, identify the Top 50 “innovative” cities in the U.S. (by whatever measure you find plausible). Provide a leaflet map showing the most innovative states and cities in the U.S. on a single map based on these information.
locat_1 <- df %>%
group_by(location_state) %>%
summarize(total_pledged = sum(pledged)) %>%
arrange(desc(total_pledged))
top_cities <- df %>%
group_by(location_town) %>%
summarize(total_pledged = sum(pledged)) %>%
arrange(desc(total_pledged)) %>%
slice(1:50)
Each project contains a blurb – a short description of the project. While not the full description of the project, the short headline is arguably important for inducing interest in the project (and ultimately popularity and success). Let’s analyze the text.
To reduce the time for analysis, select the 1000 most successful projects and a sample of 1000 unsuccessful projects (by a metric of your choice). Use the cleaning functions introduced in lecture (or write your own in addition) to remove unnecessary words (stop words), syntax, punctuation, numbers, white space etc. Note, that many projects use their own unique brand names in upper cases, so try to remove these fully capitalized words as well (since we are aiming to identify common words across descriptions). Create a document-term-matrix.
## df$pledged <- as.integer(df$pledged)
##drop_na(df, pledged)
## select top 1000 by amount pledged
top_1000 <- df %>%
filter(state=='successful') %>%
arrange(desc(pledged)) %>%
ungroup() %>%
slice(1:1000)
## select bottom 1000 by amount pledged
bottom_1000 <- df %>%
filter(state=='failed') %>%
arrange(pledged) %>%
ungroup() %>%
slice(1:1000)
## merge to create df
## data <- rbind(top_1000, bottom_1000)
## top 1000 successful projects
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
top_text <- top_1000$blurb
corp <- Corpus(VectorSource(top_text))
## clean text by removing characters
text_fun <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corp <- tm_map(corp, text_fun, "/")
## Warning in tm_map.SimpleCorpus(corp, text_fun, "/"): transformation drops
## documents
corp <- tm_map(corp, text_fun, "@")
## Warning in tm_map.SimpleCorpus(corp, text_fun, "@"): transformation drops
## documents
corp <- tm_map(corp, text_fun, "\\|")
## Warning in tm_map.SimpleCorpus(corp, text_fun, "\\|"): transformation drops
## documents
## lowercase
corp <- tm_map(corp, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corp, content_transformer(tolower)):
## transformation drops documents
## remove numbers
corp <- tm_map(corp, removeNumbers)
## Warning in tm_map.SimpleCorpus(corp, removeNumbers): transformation drops
## documents
## remove stopwords
corp <- tm_map(corp, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corp, removeWords, stopwords("english")):
## transformation drops documents
## remove punctuation
corp <- tm_map(corp, removePunctuation)
## Warning in tm_map.SimpleCorpus(corp, removePunctuation): transformation drops
## documents
## remove extra white spaces
corp <- tm_map(corp, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corp, stripWhitespace): transformation drops
## documents
## stemming
corp <- tm_map(corp, stemDocument)
## Warning in tm_map.SimpleCorpus(corp, stemDocument): transformation drops
## documents
## lemmatize
library(textstem)
## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
##
## available.koRpus.lang()
##
## and see ?install.koRpus.lang()
##
## Attaching package: 'koRpus'
## The following object is masked from 'package:tm':
##
## readTagged
## The following object is masked from 'package:readr':
##
## tokenize
clean_corp <- lemmatize_words(corp)
library(tidytext)
## successful projects document-term-matrix
## build matrix
top_dtm <- DocumentTermMatrix(clean_corp)
top_td <- tidy(top_dtm)
top_lemma <- top_td %>%
mutate(lemma = lemmatize_words(term))
top_lemma <- top_lemma %>%
group_by(lemma) %>%
summarize(count = sum(count))
## bottom 1000 failed projects
bottom_text <- bottom_1000$blurb
corp_1 <- Corpus(VectorSource(bottom_text))
## clean text by removing characters
text_fun <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
corp_1 <- tm_map(corp_1, text_fun, "/")
## Warning in tm_map.SimpleCorpus(corp_1, text_fun, "/"): transformation drops
## documents
corp_1 <- tm_map(corp_1, text_fun, "@")
## Warning in tm_map.SimpleCorpus(corp_1, text_fun, "@"): transformation drops
## documents
corp_1 <- tm_map(corp_1, text_fun, "\\|")
## Warning in tm_map.SimpleCorpus(corp_1, text_fun, "\\|"): transformation drops
## documents
## lowercase
corp_1 <- tm_map(corp_1, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corp_1, content_transformer(tolower)):
## transformation drops documents
## remove numbers
corp_1 <- tm_map(corp_1, removeNumbers)
## Warning in tm_map.SimpleCorpus(corp_1, removeNumbers): transformation drops
## documents
## remove stopwords
corp_1 <- tm_map(corp_1, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corp_1, removeWords, stopwords("english")):
## transformation drops documents
## remove punctuation
corp_1 <- tm_map(corp_1, removePunctuation)
## Warning in tm_map.SimpleCorpus(corp_1, removePunctuation): transformation drops
## documents
## remove extra white spaces
corp_1 <- tm_map(corp_1, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corp_1, stripWhitespace): transformation drops
## documents
## stemming
corp_1 <- tm_map(corp_1, stemDocument)
## Warning in tm_map.SimpleCorpus(corp_1, stemDocument): transformation drops
## documents
## lemmatize
clean_corp1 <- lemmatize_words(corp_1)
## failed projects document-term-matrix
## build matrix
bottom_dtm <- DocumentTermMatrix(clean_corp1)
bottom_td <- tidy(bottom_dtm)
bottom_lemma <- bottom_td %>%
mutate(lemma = lemmatize_words(term))
bottom_lemma <- bottom_lemma %>%
group_by(lemma) %>%
summarize(count = sum(count))
Provide a word cloud of the most frequent or important words (your choice which frequency measure you choose) among the most successful projects.
library(wordcloud)
## Loading required package: RColorBrewer
set.seed(1234)
wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq = 7,
max.words=100, random.order=FALSE,
colors=brewer.pal(9, "Dark2"))
## Warning in brewer.pal(9, "Dark2"): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : screen could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : compact could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : qualiti could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : track could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = top_lemma$lemma, freq = top_lemma$count, min.freq =
## 7, : travel could not be fit on page. It will not be plotted.
Provide a pyramid plot to show how the words between successful and unsuccessful projects differ in frequency. A selection of 10 - 20 top words is sufficient here.
top_tdm1 <- TermDocumentMatrix(clean_corp)
top_mat <- as.matrix(top_tdm1)
top_tdm_v <- sort(rowSums(top_mat), decreasing = TRUE)
top_tdm_d <- data.frame(word = names(top_tdm_v), freq = top_tdm_v)
bottom_tdm1 <-TermDocumentMatrix(clean_corp1)
bottom_mat <- as.matrix(bottom_tdm1)
bottom_tdm_v <- sort(rowSums(bottom_mat), decreasing = TRUE)
bottom_tdm_d <- data.frame(word = names(bottom_tdm_v), freq = bottom_tdm_v)
## create merged data frame
data = merge(x = top_tdm_d, y = bottom_tdm_d, by = 'word')
colnames(data)[colnames(data) %in% c("freq.x", "freq.y")] <- c("successful", "unsuccessful")
colnames(data)
## [1] "word" "successful" "unsuccessful"
data$final <- data$successful + data$unsuccessful
## top 20
data_pyramid<-data %>%
arrange(desc(final)) %>%
mutate(rank=row_number()) %>%
filter(rank<=20)
## pyramid plot
library(plotrix)
##
## Attaching package: 'plotrix'
## The following object is masked from 'package:scales':
##
## rescale
pyr_plot <- pyramid.plot(data_pyramid$successful, data_pyramid$unsuccessful,
labels = data_pyramid$word,
gap = 10,
top.labels = c("Successful Words", " ", "Unsuccessful Words"),
main = "Top 20 Words in Successful and Unsuccessful projects",
laxlab = NULL,
raxlab = NULL,
unit = NULL,
labelcex=0.8)
## 126 126
These blurbs are short in length (max. 150 characters) but let’s see whether brevity and simplicity still matters. Calculate a readability measure (Flesh Reading Ease, Flesh Kincaid or any other comparable measure) for the texts. Visualize the relationship between the readability measure and one of the measures of success. Briefly comment on your finding.
require(quanteda)
## Loading required package: quanteda
## Warning: package 'quanteda' was built under R version 4.1.2
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:koRpus':
##
## tokens, types
## The following object is masked from 'package:tm':
##
## stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
require(dplyr)
require(quanteda.textstats)
## Loading required package: quanteda.textstats
## successful projects
top_corpus <- corpus(top_text)
FRE_top <- textstat_readability(top_corpus,
measure=c('Flesch.Kincaid'))
FRE_top
## document Flesch.Kincaid
## 1 text1 16.7223529
## 2 text2 8.3700000
## 3 text3 9.5500000
## 4 text4 8.3700000
## 5 text5 12.8285714
## 6 text6 10.4866667
## 7 text7 13.8600000
## 8 text8 6.0100000
## 9 text9 7.0478571
## 10 text10 5.8254348
## 11 text11 2.8955000
## 12 text12 12.8285714
## 13 text13 11.6800000
## 14 text14 12.3147368
## 15 text15 13.2517647
## 16 text16 8.1716667
## 17 text17 7.2095238
## 18 text18 8.5415385
## 19 text19 12.8285714
## 20 text20 8.7296154
## 21 text21 10.5809524
## 22 text22 6.7900000
## 23 text23 3.2602174
## 24 text24 10.3104348
## 25 text25 13.4500000
## 26 text26 6.2755556
## 27 text27 12.5576471
## 28 text28 7.3676316
## 29 text29 4.8610000
## 30 text30 5.8640909
## 31 text31 11.9616667
## 32 text32 12.9450000
## 33 text33 2.4700000
## 34 text34 4.8400000
## 35 text35 10.5809524
## 36 text36 7.6125000
## 37 text37 6.9363636
## 38 text38 6.6000000
## 39 text39 13.9523810
## 40 text40 13.5568421
## 41 text41 6.9368182
## 42 text42 8.0090909
## 43 text43 14.5142857
## 44 text44 13.5125000
## 45 text45 1.8200000
## 46 text46 9.6550000
## 47 text47 10.5000000
## 48 text48 8.3500000
## 49 text49 8.7800000
## 50 text50 14.9875000
## 51 text51 7.1900000
## 52 text52 7.3645652
## 53 text53 10.9350000
## 54 text54 2.7792593
## 55 text55 7.1602941
## 56 text56 4.7909091
## 57 text57 12.9357895
## 58 text58 9.3616667
## 59 text59 15.0761905
## 60 text60 1.8983333
## 61 text61 4.7913636
## 62 text62 13.9523810
## 63 text63 16.9900000
## 64 text64 4.8300000
## 65 text65 6.7800000
## 66 text66 7.1900000
## 67 text67 12.6050000
## 68 text68 8.1716667
## 69 text69 13.4500000
## 70 text70 10.7233333
## 71 text71 5.3646154
## 72 text72 12.3626087
## 73 text73 7.1400000
## 74 text74 3.8260000
## 75 text75 7.4727273
## 76 text76 10.7233333
## 77 text77 11.3000000
## 78 text78 11.3365217
## 79 text79 10.5000000
## 80 text80 5.3150000
## 81 text81 13.2955556
## 82 text82 3.6500000
## 83 text83 14.7989474
## 84 text84 13.4866667
## 85 text85 13.8600000
## 86 text86 11.5000000
## 87 text87 12.5576471
## 88 text88 4.0903704
## 89 text89 7.6518182
## 90 text90 6.1326316
## 91 text91 7.9957895
## 92 text92 13.4366667
## 93 text93 5.8066667
## 94 text94 10.4411111
## 95 text95 10.7233333
## 96 text96 6.3384783
## 97 text97 5.3621429
## 98 text98 6.4000000
## 99 text99 7.3645652
## 100 text100 12.8600000
## 101 text101 7.9886842
## 102 text102 11.6800000
## 103 text103 4.3890000
## 104 text104 11.8495652
## 105 text105 8.4128571
## 106 text106 15.4200000
## 107 text107 15.2200000
## 108 text108 7.6097619
## 109 text109 7.4727273
## 110 text110 6.2983333
## 111 text111 4.8434783
## 112 text112 12.3626087
## 113 text113 12.5083333
## 114 text114 9.8589474
## 115 text115 13.3904762
## 116 text116 11.6936842
## 117 text117 13.0733333
## 118 text118 13.2517647
## 119 text119 8.1800000
## 120 text120 13.5120000
## 121 text121 11.0900000
## 122 text122 8.5454545
## 123 text123 11.8675000
## 124 text124 13.4500000
## 125 text125 8.6097368
## 126 text126 7.6338462
## 127 text127 14.6300000
## 128 text128 9.6550000
## 129 text129 12.8354545
## 130 text130 8.7335714
## 131 text131 9.2843478
## 132 text132 13.5568421
## 133 text133 14.9875000
## 134 text134 11.0726316
## 135 text135 11.2272727
## 136 text136 14.6300000
## 137 text137 3.8972727
## 138 text138 13.7188889
## 139 text139 10.0985714
## 140 text140 7.6097619
## 141 text141 11.7842857
## 142 text142 7.7800000
## 143 text143 13.5568421
## 144 text144 5.0779412
## 145 text145 11.8495652
## 146 text146 12.0375000
## 147 text147 6.4859524
## 148 text148 9.1300000
## 149 text149 10.0190476
## 150 text150 13.5568421
## 151 text151 4.3316667
## 152 text152 7.6097619
## 153 text153 9.8573810
## 154 text154 10.1483333
## 155 text155 9.9666667
## 156 text156 7.9886842
## 157 text157 8.9600000
## 158 text158 2.3422222
## 159 text159 11.5000000
## 160 text160 9.8250000
## 161 text161 6.7607895
## 162 text162 7.1602941
## 163 text163 7.4086957
## 164 text164 11.4700000
## 165 text165 12.2666667
## 166 text166 9.2307895
## 167 text167 5.3330000
## 168 text168 13.4366667
## 169 text169 14.4147826
## 170 text170 11.5000000
## 171 text171 8.3533333
## 172 text172 7.6988235
## 173 text173 15.2200000
## 174 text174 9.8305263
## 175 text175 17.0300000
## 176 text176 13.0633333
## 177 text177 4.8233333
## 178 text178 17.9375000
## 179 text179 12.8600000
## 180 text180 12.8585185
## 181 text181 3.7550000
## 182 text182 8.3333333
## 183 text183 10.4192857
## 184 text184 15.3341176
## 185 text185 6.4661765
## 186 text186 11.2646154
## 187 text187 12.2990909
## 188 text188 12.2700000
## 189 text189 5.8522222
## 190 text190 12.8354545
## 191 text191 15.4500000
## 192 text192 12.6900000
## 193 text193 5.8453704
## 194 text194 15.4500000
## 195 text195 9.2307895
## 196 text196 8.3533333
## 197 text197 1.6073529
## 198 text198 7.0478571
## 199 text199 6.9311111
## 200 text200 8.5750000
## 201 text201 10.9558696
## 202 text202 10.4866667
## 203 text203 7.4718182
## 204 text204 8.5415385
## 205 text205 7.7800000
## 206 text206 9.0542857
## 207 text207 8.3929412
## 208 text208 7.9886842
## 209 text209 9.4571429
## 210 text210 14.7989474
## 211 text211 6.8515217
## 212 text212 9.2307895
## 213 text213 5.8066667
## 214 text214 11.7047619
## 215 text215 10.4866667
## 216 text216 6.1326316
## 217 text217 13.5568421
## 218 text218 7.3685714
## 219 text219 10.5809524
## 220 text220 13.3718182
## 221 text221 9.5500000
## 222 text222 17.2831579
## 223 text223 12.6271429
## 224 text224 13.9523810
## 225 text225 14.1422222
## 226 text226 7.7733333
## 227 text227 11.1428571
## 228 text228 7.0478571
## 229 text229 10.7233333
## 230 text230 10.6900000
## 231 text231 6.2150000
## 232 text232 12.8600000
## 233 text233 14.7989474
## 234 text234 6.7050000
## 235 text235 8.1716667
## 236 text236 7.3685714
## 237 text237 6.9363636
## 238 text238 9.4167391
## 239 text239 11.9100000
## 240 text240 9.2954762
## 241 text241 12.8600000
## 242 text242 10.5809524
## 243 text243 6.4000000
## 244 text244 8.1400000
## 245 text245 11.6800000
## 246 text246 8.3833333
## 247 text247 8.5415385
## 248 text248 12.3147368
## 249 text249 11.2646154
## 250 text250 6.3384783
## 251 text251 10.1545455
## 252 text252 12.2700000
## 253 text253 12.3147368
## 254 text254 13.0400000
## 255 text255 12.9450000
## 256 text256 5.3272727
## 257 text257 9.6181818
## 258 text258 14.4445455
## 259 text259 11.7627273
## 260 text260 10.8234783
## 261 text261 8.1716667
## 262 text262 17.4164706
## 263 text263 7.3685714
## 264 text264 10.2080000
## 265 text265 3.6533333
## 266 text266 11.1520000
## 267 text267 8.7300000
## 268 text268 13.3904762
## 269 text269 4.0030769
## 270 text270 11.4700000
## 271 text271 13.9523810
## 272 text272 6.6000000
## 273 text273 7.4425000
## 274 text274 6.7900000
## 275 text275 5.4200000
## 276 text276 6.7900000
## 277 text277 8.9600000
## 278 text278 7.3676316
## 279 text279 10.4346154
## 280 text280 7.1633333
## 281 text281 6.3384783
## 282 text282 2.4616848
## 283 text283 9.0818182
## 284 text284 7.3818421
## 285 text285 10.1536364
## 286 text286 8.8977778
## 287 text287 13.6733333
## 288 text288 8.1716667
## 289 text289 11.1520000
## 290 text290 16.6621053
## 291 text291 6.4000000
## 292 text292 10.7133333
## 293 text293 10.4866667
## 294 text294 10.9783333
## 295 text295 6.3105882
## 296 text296 10.3104348
## 297 text297 3.9971429
## 298 text298 10.0985714
## 299 text299 15.0761905
## 300 text300 11.0966667
## 301 text301 13.4500000
## 302 text302 11.7657143
## 303 text303 12.8285714
## 304 text304 5.8184615
## 305 text305 5.8522222
## 306 text306 12.2700000
## 307 text307 4.2623684
## 308 text308 14.3128571
## 309 text309 10.3104348
## 310 text310 11.0966667
## 311 text311 8.7335714
## 312 text312 3.7550000
## 313 text313 8.2114286
## 314 text314 12.7750000
## 315 text315 7.3645652
## 316 text316 11.5828571
## 317 text317 6.6200000
## 318 text318 10.4835526
## 319 text319 5.3621429
## 320 text320 15.0300000
## 321 text321 7.0047059
## 322 text322 14.1778947
## 323 text323 8.0090909
## 324 text324 4.9644444
## 325 text325 10.9811905
## 326 text326 12.2666667
## 327 text327 14.0400000
## 328 text328 14.9278261
## 329 text329 16.6621053
## 330 text330 4.9644444
## 331 text331 8.2582609
## 332 text332 11.0726316
## 333 text333 7.6988235
## 334 text334 12.5576471
## 335 text335 8.0090909
## 336 text336 1.3133333
## 337 text337 6.0100000
## 338 text338 6.3384783
## 339 text339 11.7627273
## 340 text340 11.8495652
## 341 text341 6.7050000
## 342 text342 8.1716667
## 343 text343 6.7252632
## 344 text344 12.2866667
## 345 text345 8.5415385
## 346 text346 10.3569231
## 347 text347 6.0100000
## 348 text348 4.3316667
## 349 text349 15.4033333
## 350 text350 12.8285714
## 351 text351 5.8636364
## 352 text352 7.5866667
## 353 text353 8.1400000
## 354 text354 3.9170000
## 355 text355 5.0779412
## 356 text356 7.8544118
## 357 text357 12.0191176
## 358 text358 13.3886957
## 359 text359 11.5000000
## 360 text360 8.1716667
## 361 text361 13.0800000
## 362 text362 12.9450000
## 363 text363 8.7566667
## 364 text364 9.7973913
## 365 text365 11.3250000
## 366 text366 14.9875000
## 367 text367 11.9616667
## 368 text368 9.5500000
## 369 text369 8.5454545
## 370 text370 8.0090909
## 371 text371 9.8250000
## 372 text372 3.0175000
## 373 text373 5.4333333
## 374 text374 5.2466667
## 375 text375 2.2800000
## 376 text376 9.2954762
## 377 text377 8.9600000
## 378 text378 9.0730769
## 379 text379 8.9036957
## 380 text380 10.6900000
## 381 text381 11.4700000
## 382 text382 8.7566667
## 383 text383 4.9644444
## 384 text384 12.9357895
## 385 text385 6.4000000
## 386 text386 6.2770000
## 387 text387 12.3626087
## 388 text388 3.6601852
## 389 text389 7.4727273
## 390 text390 11.1520000
## 391 text391 9.0542857
## 392 text392 11.7047619
## 393 text393 4.0763636
## 394 text394 6.5790909
## 395 text395 5.5238095
## 396 text396 12.8354545
## 397 text397 11.5000000
## 398 text398 7.1900000
## 399 text399 13.7188889
## 400 text400 3.6700000
## 401 text401 3.6413158
## 402 text402 15.4333333
## 403 text403 10.2088889
## 404 text404 9.6550000
## 405 text405 14.0400000
## 406 text406 17.4164706
## 407 text407 6.8515217
## 408 text408 6.7271429
## 409 text409 12.5680000
## 410 text410 6.8750000
## 411 text411 7.6600000
## 412 text412 13.3886957
## 413 text413 1.7888158
## 414 text414 11.3200000
## 415 text415 7.1900000
## 416 text416 11.0900000
## 417 text417 5.9022222
## 418 text418 7.6000000
## 419 text419 10.8644444
## 420 text420 16.3411111
## 421 text421 13.2165517
## 422 text422 16.0282353
## 423 text423 15.8030769
## 424 text424 8.0090909
## 425 text425 16.9966667
## 426 text426 8.9175000
## 427 text427 6.2723077
## 428 text428 12.8600000
## 429 text429 8.1716667
## 430 text430 9.7400000
## 431 text431 14.2700000
## 432 text432 10.9783333
## 433 text433 13.5568421
## 434 text434 5.2300000
## 435 text435 6.4000000
## 436 text436 13.5120000
## 437 text437 7.6600000
## 438 text438 12.3147368
## 439 text439 8.3833333
## 440 text440 11.1694118
## 441 text441 13.9458824
## 442 text442 10.1536364
## 443 text443 13.7188889
## 444 text444 9.2426471
## 445 text445 6.4859524
## 446 text446 13.3904762
## 447 text447 11.7047619
## 448 text448 14.6300000
## 449 text449 7.0478571
## 450 text450 13.9523810
## 451 text451 10.2088889
## 452 text452 11.9100000
## 453 text453 8.4128571
## 454 text454 10.4866667
## 455 text455 6.2983333
## 456 text456 10.4866667
## 457 text457 10.6900000
## 458 text458 12.3147368
## 459 text459 7.9886842
## 460 text460 10.1536364
## 461 text461 10.0190476
## 462 text462 -0.2834259
## 463 text463 10.0190476
## 464 text464 5.9675000
## 465 text465 9.9600000
## 466 text466 8.2792647
## 467 text467 8.0090909
## 468 text468 11.0900000
## 469 text469 7.8991667
## 470 text470 6.3500000
## 471 text471 10.5809524
## 472 text472 4.8233333
## 473 text473 6.4000000
## 474 text474 6.3105882
## 475 text475 9.0542857
## 476 text476 10.6900000
## 477 text477 4.7909091
## 478 text478 12.2666667
## 479 text479 9.5500000
## 480 text480 5.8636364
## 481 text481 11.7047619
## 482 text482 17.7933333
## 483 text483 9.7973913
## 484 text484 12.8600000
## 485 text485 4.0000000
## 486 text486 10.4752941
## 487 text487 14.5142857
## 488 text488 11.7047619
## 489 text489 5.3621429
## 490 text490 14.2700000
## 491 text491 6.6000000
## 492 text492 12.2666667
## 493 text493 11.0726316
## 494 text494 12.8285714
## 495 text495 9.2954762
## 496 text496 14.3744444
## 497 text497 3.7080000
## 498 text498 8.3833333
## 499 text499 6.9363636
## 500 text500 5.3565217
## 501 text501 9.0116667
## 502 text502 10.8884615
## 503 text503 3.6484211
## 504 text504 6.1255263
## 505 text505 9.1400000
## 506 text506 8.3533333
## 507 text507 5.6200000
## 508 text508 10.3500000
## 509 text509 12.4533333
## 510 text510 8.9600000
## 511 text511 16.6233333
## 512 text512 11.0726316
## 513 text513 18.8047059
## 514 text514 6.2983333
## 515 text515 9.8589474
## 516 text516 7.8776087
## 517 text517 14.3744444
## 518 text518 9.0818182
## 519 text519 5.8636364
## 520 text520 8.0090909
## 521 text521 8.1716667
## 522 text522 15.3341176
## 523 text523 8.4128571
## 524 text524 12.5576471
## 525 text525 1.1642105
## 526 text526 13.9523810
## 527 text527 7.1154545
## 528 text528 5.2300000
## 529 text529 7.6338462
## 530 text530 7.6338462
## 531 text531 5.3565217
## 532 text532 8.9600000
## 533 text533 12.8285714
## 534 text534 10.7233333
## 535 text535 8.2422222
## 536 text536 15.4333333
## 537 text537 15.7250000
## 538 text538 12.4533333
## 539 text539 8.1800000
## 540 text540 12.8354545
## 541 text541 6.2755556
## 542 text542 9.9100000
## 543 text543 4.8233333
## 544 text544 5.3330000
## 545 text545 13.4500000
## 546 text546 19.1685714
## 547 text547 7.0047059
## 548 text548 8.0081818
## 549 text549 9.0818182
## 550 text550 9.0818182
## 551 text551 11.8675000
## 552 text552 14.6300000
## 553 text553 11.6936842
## 554 text554 10.2088889
## 555 text555 10.3500000
## 556 text556 19.9800000
## 557 text557 13.5568421
## 558 text558 8.9600000
## 559 text559 13.5568421
## 560 text560 -3.4000000
## 561 text561 2.8245455
## 562 text562 8.0097727
## 563 text563 13.3718182
## 564 text564 11.0966667
## 565 text565 7.4727273
## 566 text566 12.8600000
## 567 text567 6.3105882
## 568 text568 8.3929412
## 569 text569 8.3929412
## 570 text570 8.3929412
## 571 text571 10.4752941
## 572 text572 11.1428571
## 573 text573 11.3365217
## 574 text574 4.8400000
## 575 text575 14.0400000
## 576 text576 12.3000000
## 577 text577 7.4727273
## 578 text578 8.3533333
## 579 text579 7.6125000
## 580 text580 11.7047619
## 581 text581 6.0065385
## 582 text582 7.1633333
## 583 text583 5.8254348
## 584 text584 14.0400000
## 585 text585 8.3700000
## 586 text586 0.5166667
## 587 text587 5.3272727
## 588 text588 9.4492308
## 589 text589 13.3718182
## 590 text590 10.0190476
## 591 text591 7.8188889
## 592 text592 11.3365217
## 593 text593 6.8515217
## 594 text594 3.8972727
## 595 text595 7.7733333
## 596 text596 12.8354545
## 597 text597 14.7989474
## 598 text598 11.3442857
## 599 text599 16.5900000
## 600 text600 13.0733333
## 601 text601 11.8635294
## 602 text602 7.3747368
## 603 text603 8.6150000
## 604 text604 6.8750000
## 605 text605 11.3000000
## 606 text606 10.6900000
## 607 text607 10.6900000
## 608 text608 14.7989474
## 609 text609 10.1545455
## 610 text610 13.7188889
## 611 text611 11.1428571
## 612 text612 12.0087500
## 613 text613 3.7181818
## 614 text614 8.4744444
## 615 text615 13.0800000
## 616 text616 7.2816667
## 617 text617 14.7989474
## 618 text618 9.0870588
## 619 text619 11.0966667
## 620 text620 10.9414286
## 621 text621 11.1520000
## 622 text622 6.0065385
## 623 text623 12.5680000
## 624 text624 11.3365217
## 625 text625 6.7536842
## 626 text626 13.4500000
## 627 text627 4.9644444
## 628 text628 8.0090909
## 629 text629 12.9450000
## 630 text630 10.3500000
## 631 text631 5.8636364
## 632 text632 6.7536842
## 633 text633 9.2954762
## 634 text634 5.8636364
## 635 text635 11.5000000
## 636 text636 15.9985714
## 637 text637 18.6750000
## 638 text638 9.7811765
## 639 text639 9.4492308
## 640 text640 9.8091176
## 641 text641 7.4727273
## 642 text642 8.3333333
## 643 text643 11.0900000
## 644 text644 6.9363636
## 645 text645 6.0100000
## 646 text646 13.3904762
## 647 text647 12.8285714
## 648 text648 9.8573810
## 649 text649 15.0300000
## 650 text650 6.8515217
## 651 text651 11.0900000
## 652 text652 9.8250000
## 653 text653 11.5000000
## 654 text654 9.6550000
## 655 text655 4.9700000
## 656 text656 10.7300000
## 657 text657 8.7566667
## 658 text658 11.9100000
## 659 text659 9.5533333
## 660 text660 5.8636364
## 661 text661 9.5533333
## 662 text662 9.5700000
## 663 text663 14.1014706
## 664 text664 6.2133333
## 665 text665 12.2700000
## 666 text666 9.7811765
## 667 text667 6.0100000
## 668 text668 9.8305263
## 669 text669 7.1900000
## 670 text670 11.7047619
## 671 text671 10.9783333
## 672 text672 6.3384783
## 673 text673 7.5866667
## 674 text674 7.1900000
## 675 text675 15.6380952
## 676 text676 14.0400000
## 677 text677 9.6172727
## 678 text678 5.9240476
## 679 text679 11.1520000
## 680 text680 12.8285714
## 681 text681 4.4200000
## 682 text682 5.3621429
## 683 text683 11.5000000
## 684 text684 6.7050000
## 685 text685 7.7800000
## 686 text686 11.7627273
## 687 text687 -2.8800000
## 688 text688 1.7888158
## 689 text689 13.3886957
## 690 text690 9.6172727
## 691 text691 12.1723077
## 692 text692 11.3365217
## 693 text693 5.0779412
## 694 text694 11.7627273
## 695 text695 11.6800000
## 696 text696 4.9107692
## 697 text697 7.7800000
## 698 text698 11.2263636
## 699 text699 14.4445455
## 700 text700 9.2557143
## 701 text701 13.3718182
## 702 text702 14.7989474
## 703 text703 8.3700000
## 704 text704 9.0818182
## 705 text705 11.0900000
## 706 text706 10.5000000
## 707 text707 16.0410526
## 708 text708 7.1900000
## 709 text709 4.8987273
## 710 text710 12.8600000
## 711 text711 8.9600000
## 712 text712 6.2770000
## 713 text713 13.9523810
## 714 text714 4.7132759
## 715 text715 3.3200000
## 716 text716 13.6115385
## 717 text717 6.0100000
## 718 text718 10.7133333
## 719 text719 9.1300000
## 720 text720 12.2666667
## 721 text721 6.0427273
## 722 text722 8.9600000
## 723 text723 14.5142857
## 724 text724 14.1422222
## 725 text725 9.0875000
## 726 text726 8.4744444
## 727 text727 12.4533333
## 728 text728 7.4727273
## 729 text729 9.2954762
## 730 text730 10.7751724
## 731 text731 7.6000000
## 732 text732 16.3411111
## 733 text733 11.5000000
## 734 text734 8.3533333
## 735 text735 6.2755556
## 736 text736 9.1300000
## 737 text737 4.6520000
## 738 text738 7.7800000
## 739 text739 15.6855556
## 740 text740 8.9600000
## 741 text741 7.6988235
## 742 text742 11.7627273
## 743 text743 16.0410526
## 744 text744 10.5625000
## 745 text745 4.2863043
## 746 text746 5.8842857
## 747 text747 5.8184615
## 748 text748 12.8600000
## 749 text749 19.4988235
## 750 text750 7.6600000
## 751 text751 13.4366667
## 752 text752 4.3838235
## 753 text753 6.1255263
## 754 text754 10.2362963
## 755 text755 4.8987273
## 756 text756 15.6855556
## 757 text757 10.8234783
## 758 text758 4.9107692
## 759 text759 18.9900000
## 760 text760 10.3569231
## 761 text761 7.4727273
## 762 text762 12.8354545
## 763 text763 6.1375000
## 764 text764 15.0761905
## 765 text765 1.0822609
## 766 text766 13.9523810
## 767 text767 8.3333333
## 768 text768 4.7909091
## 769 text769 13.0733333
## 770 text770 2.8800000
## 771 text771 11.2263636
## 772 text772 0.5200000
## 773 text773 7.9673684
## 774 text774 10.4866667
## 775 text775 12.8600000
## 776 text776 4.4500000
## 777 text777 11.0900000
## 778 text778 13.9876923
## 779 text779 10.9783333
## 780 text780 6.0100000
## 781 text781 8.5415385
## 782 text782 13.6905882
## 783 text783 7.6125000
## 784 text784 6.7252632
## 785 text785 12.3147368
## 786 text786 -1.8400000
## 787 text787 13.9081818
## 788 text788 9.0542857
## 789 text789 6.4000000
## 790 text790 7.6097619
## 791 text791 6.9311111
## 792 text792 9.8660526
## 793 text793 10.5625000
## 794 text794 6.7900000
## 795 text795 11.2263636
## 796 text796 11.0900000
## 797 text797 6.7050000
## 798 text798 11.4700000
## 799 text799 8.5445455
## 800 text800 6.4661765
## 801 text801 15.5550000
## 802 text802 8.2650000
## 803 text803 9.1800000
## 804 text804 9.5500000
## 805 text805 8.5454545
## 806 text806 3.2207692
## 807 text807 14.1422222
## 808 text808 8.7713043
## 809 text809 5.8184615
## 810 text810 10.8884615
## 811 text811 14.2500000
## 812 text812 5.3330000
## 813 text813 6.8515217
## 814 text814 6.7800000
## 815 text815 8.0090909
## 816 text816 11.3365217
## 817 text817 16.3411111
## 818 text818 5.8636364
## 819 text819 8.1716667
## 820 text820 14.2700000
## 821 text821 13.3731818
## 822 text822 10.1536364
## 823 text823 11.7627273
## 824 text824 11.0557353
## 825 text825 11.0900000
## 826 text826 9.8250000
## 827 text827 15.8100000
## 828 text828 12.0375000
## 829 text829 10.4752941
## 830 text830 11.1520000
## 831 text831 14.0400000
## 832 text832 11.7047619
## 833 text833 7.1633333
## 834 text834 11.3000000
## 835 text835 9.6172727
## 836 text836 8.0090909
## 837 text837 13.4700000
## 838 text838 8.3906522
## 839 text839 3.7186364
## 840 text840 5.8066667
## 841 text841 9.5500000
## 842 text842 13.3886957
## 843 text843 11.2272727
## 844 text844 6.2755556
## 845 text845 11.6800000
## 846 text846 10.6308824
## 847 text847 12.8285714
## 848 text848 7.6097619
## 849 text849 13.5568421
## 850 text850 11.2646154
## 851 text851 13.4866667
## 852 text852 12.3147368
## 853 text853 12.7750000
## 854 text854 4.6200000
## 855 text855 8.7713043
## 856 text856 11.9616667
## 857 text857 9.0818182
## 858 text858 14.5142857
## 859 text859 12.3626087
## 860 text860 4.7909091
## 861 text861 9.0822727
## 862 text862 11.1428571
## 863 text863 13.5568421
## 864 text864 7.7714286
## 865 text865 11.0966667
## 866 text866 5.8254348
## 867 text867 8.5454545
## 868 text868 6.0692308
## 869 text869 14.7989474
## 870 text870 2.8900000
## 871 text871 16.0282353
## 872 text872 11.7627273
## 873 text873 9.0875000
## 874 text874 15.9911538
## 875 text875 4.3316667
## 876 text876 8.3700000
## 877 text877 11.8635294
## 878 text878 20.5900000
## 879 text879 10.1536364
## 880 text880 3.3609091
## 881 text881 14.6800000
## 882 text882 14.6300000
## 883 text883 10.5000000
## 884 text884 9.6550000
## 885 text885 2.3422222
## 886 text886 7.3645652
## 887 text887 6.3384783
## 888 text888 6.4661765
## 889 text889 11.4700000
## 890 text890 10.6800000
## 891 text891 4.4925000
## 892 text892 14.1778947
## 893 text893 8.7335714
## 894 text894 12.2666667
## 895 text895 12.8311111
## 896 text896 8.1716667
## 897 text897 11.8495652
## 898 text898 5.0988462
## 899 text899 13.9081818
## 900 text900 4.7993478
## 901 text901 3.6500000
## 902 text902 10.4411111
## 903 text903 5.3277273
## 904 text904 11.3365217
## 905 text905 4.7909091
## 906 text906 7.7800000
## 907 text907 4.8400000
## 908 text908 11.6240000
## 909 text909 10.0900000
## 910 text910 9.0875000
## 911 text911 7.5500000
## 912 text912 5.8400000
## 913 text913 6.4000000
## 914 text914 5.8636364
## 915 text915 3.6413158
## 916 text916 11.1694118
## 917 text917 10.3500000
## 918 text918 5.2300000
## 919 text919 10.1400000
## 920 text920 6.2770000
## 921 text921 8.0283333
## 922 text922 3.7550000
## 923 text923 10.0985714
## 924 text924 9.6181818
## 925 text925 5.3123913
## 926 text926 15.6855556
## 927 text927 7.8188889
## 928 text928 3.8173913
## 929 text929 7.7733333
## 930 text930 6.5790909
## 931 text931 14.3128571
## 932 text932 13.3904762
## 933 text933 9.9100000
## 934 text934 5.3272727
## 935 text935 9.7650000
## 936 text936 16.6300000
## 937 text937 17.6522222
## 938 text938 9.9100000
## 939 text939 11.5474074
## 940 text940 12.4077778
## 941 text941 14.4445455
## 942 text942 10.1545455
## 943 text943 7.6000000
## 944 text944 14.6300000
## 945 text945 13.9876923
## 946 text946 6.9363636
## 947 text947 12.8600000
## 948 text948 10.1400000
## 949 text949 8.1400000
## 950 text950 8.3698077
## 951 text951 10.4346154
## 952 text952 7.9957895
## 953 text953 4.9107692
## 954 text954 6.0100000
## 955 text955 14.9875000
## 956 text956 3.6601852
## 957 text957 12.8354545
## 958 text958 9.1400000
## 959 text959 8.8977778
## 960 text960 11.4700000
## 961 text961 6.2755556
## 962 text962 12.8600000
## 963 text963 17.6636364
## 964 text964 9.0818182
## 965 text965 7.7733333
## 966 text966 5.8254348
## 967 text967 14.6466667
## 968 text968 11.8635294
## 969 text969 7.6338462
## 970 text970 13.2950000
## 971 text971 11.4700000
## 972 text972 2.4833333
## 973 text973 6.1375000
## 974 text974 9.0875000
## 975 text975 4.8434783
## 976 text976 4.5205556
## 977 text977 4.8434783
## 978 text978 6.4004545
## 979 text979 10.4728947
## 980 text980 4.8300000
## 981 text981 7.3463158
## 982 text982 4.9107692
## 983 text983 5.8066667
## 984 text984 11.0900000
## 985 text985 11.7842857
## 986 text986 7.1966667
## 987 text987 6.6860714
## 988 text988 10.8234783
## 989 text989 8.0090909
## 990 text990 13.9017391
## 991 text991 14.7989474
## 992 text992 2.1876923
## 993 text993 11.7627273
## 994 text994 6.9142308
## 995 text995 9.5500000
## 996 text996 10.0190476
## 997 text997 4.2545455
## 998 text998 6.2150000
## 999 text999 9.7855556
## 1000 text1000 9.5033333
## merge
topFRE <- cbind(top_1000, FRE_top)
## plot
plot_FRE <- ggplot(topFRE, aes(x = pledged, y=Flesch.Kincaid, color = pledged, size = pledged)) +
geom_point(alpha = 0.5) +
geom_smooth(method='lm', color = 'red', se = F) +
ggthemes::theme_tufte() +
labs(
x = "Amount Pledged",
y = "Flesch Kincaid Readability Score",
title = "Relationship Between Amount Pledged and Readability of Text for Successful Projects")
ggplotly(plot_FRE)
## `geom_smooth()` using formula 'y ~ x'
In measuring project success by amount pledged, we can see from the plot that there is not much of a clear relationship between readability score and amount pledged. However, the data point that has the largest amount pledged does have a higher readability score.
Now, let’s check whether the use of positive / negative words or specific emotions helps a project to be successful.
Calculate the tone of each text based on the positive and negative words that are being used. You can rely on the Hu & Liu dictionary provided in lecture or use the Bing dictionary contained in the tidytext package (tidytext::sentiments). Visualize the relationship between tone of the document and success. Briefly comment.
## load sentiment dictionary
pos <- read.table('/Users/nikkigerjarusak/Documents/github/course_content/Lectures/Week09/data/dictionaries/positive-words.txt', as.is=T)
neg <- read.table('/Users/nikkigerjarusak//Documents/github/course_content/Lectures/Week09/data/dictionaries/negative-words.txt', as.is=T)
sentiment <- function(words=c("really great good stuff bad")){
require(quanteda)
tok <- quanteda::tokens(words)
pos.count <- sum(tok[[1]]%in%pos[,1])
neg.count <- sum(tok[[1]]%in%neg[,1])
out <- (pos.count - neg.count)/(pos.count+neg.count)
}
set.seed(1234)
## get random sample of 2000
random_df <- sample_n(df, 2000)
## clean random sample
random_df$blurb <- gsub("&", " ", random_df$blurb)
random_df$blurb <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", " ", random_df$blurb)
random_df$blurb <- gsub("@\\w+", " ", random_df$blurb)
random_df$blurb <- gsub("[[:punct:]]", " ", random_df$blurb)
random_df$blurb <- gsub("[[:digit:]]", " ", random_df$blurb)
random_df$blurb <- gsub("http\\w+", " ", random_df$blurb)
random_df$blurb <- gsub("[ \t]{2,}", " ", random_df$blurb)
random_df$blurb <- gsub("ˆ\\s+|\\s+$", " ", random_df$blurb)
random_df$blurb <- gsub("\n", " ", random_df$blurb)
random_cleaned <- random_df$blurb
random_sent <- data.frame(mapply(sentiment, random_cleaned))
random_sent <- data.frame(random_sent)
random_sent[is.na(random_sent)] <- 0
colnames(random_sent) <- c('Sentiment')
random_sent <- random_sent %>%
mutate(rn = row_number())
random_df <- random_df %>%
mutate(rn = row_number())
random_data <- merge(random_sent, random_df, by= "rn")
## sentiment vs amount pledged plot
sent_plot <- ggplot(data = random_data, aes(x = Sentiment, y = pledged)) +
geom_smooth(method= "loess",color= "black") + labs(
x = "Sentiment Score",
y = "Amount Pledged",
title = "Text Sentiment Score vs. Amount Pledged")
sent_plot
## `geom_smooth()` using formula 'y ~ x'
Documents with higher sentiment scores (more positive) tend to get a larger amount pledged to the project. However, we can see a dip at the end of the graph as documents that are too positive don’t achieve a higher amount pledged. We can see that there is a sweet spot of a sentiment score betweem .5 and 1.0 that receive the largest amount pledged.
Segregate all 2,000 blurbs into positive and negative texts based on their polarity score calculated in step (a). Now, collapse the positive and negative texts into two larger documents. Create a document-term-matrix based on this collapsed set of two documents. Generate a comparison cloud showing the most-frequent positive and negative words.
set.seed(1234)
## random 2000 sample
random_sample <- sample_n(df, 2000)
random_text <- random_sample$blurb
## clean using tidytext
random_text <- tolower(random_text)
random_text <- removePunctuation(random_text)
random_text<- removeNumbers(random_text)
random_text <- stripWhitespace(random_text)
random_text <- removeWords(random_text, stopwords("en"))
random_text <- lemmatize_words(random_text)
## apply sentiment function
text_sentiment <- data.frame(mapply(sentiment, random_text))
## merge
final_sent <- cbind(random_sample, text_sentiment)
final_sentiment <- final_sent$blurb
final_sent_corp <- VCorpus(VectorSource(final_sentiment))
final_sent_corp <- tm_map(final_sent_corp,
content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')))
final_sent_corp <- tm_map(final_sent_corp, PlainTextDocument)
final_sent_corp <- tm_map(final_sent_corp, removeWords, stopwords("english"))
final_sent_corp <- tm_map(final_sent_corp, removePunctuation)
final_sent_corp <- tm_map(final_sent_corp, removeNumbers)
final_sent_corp <- tm_map(final_sent_corp, content_transformer(tolower))
final_sent_dtm <- DocumentTermMatrix(final_sent_corp)
final_sent_dtm <- tidy(final_sent_dtm)
sent_clean <- final_sent_dtm %>%
mutate(clean_word = lemmatize_words(term))
bing <- get_sentiments('bing')
final_sent_clean <-merge(sent_clean, bing,by.x='clean_word',by.y='word')
words <- final_sent_clean %>%
group_by(clean_word)%>%
summarize(count=n(), sentiment=first(sentiment)) %>%
arrange(count)
## comparison word cloud
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
matrix<- acast(words, clean_word~sentiment, value.var='count', fill=0)
comparison.cloud(matrix, colors=c('blue', 'pink'))
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): romantic could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): success could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): wonderful could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): forsake could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): freeze could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): impractical
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): knock could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): miss could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): nightmare could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): numb could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): obstacle could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): outbreak could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): problem could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): resistance
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): revenge could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): shake could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): shock could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): smelly could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): smoke could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): split could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): threat could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): tragedy could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): trap could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): trick could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unemployed
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unknown could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unleash could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unravel could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unstable could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): vengeance could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): weird could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): appreciate
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): convenient
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): delight could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): empathy could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): enhance could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): guarantee could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): illuminate
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): magical could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): passionate
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): patient could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): powerful could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): precious could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): premier could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): reclaim could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): attraction
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): authentic could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): captivate could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): comfortable
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): compassion
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): dynamic could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): fortune could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): gorgeous could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): hopeful could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): inspiration
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): inspirational
## could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): legendary could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): luxury could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): promise could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): proper could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): protect could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): proud could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): secure could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): smooth could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): soulful could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): spiritual could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): strong could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): stylish could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): survivor could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): thank could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): wonder could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): angry could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): daunt could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): misfit could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): mope could not
## be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): slowly could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unable could
## not be fit on page. It will not be plotted.
## Warning in comparison.cloud(matrix, colors = c("blue", "pink")): unfaithful
## could not be fit on page. It will not be plotted.
citation: referred to code found here: https://gist.github.com/rer145/8f31af53a9a8339dddbef93fd10e86ce
Now, use the NRC Word-Emotion Association Lexicon in the tidytext package to identify a larger set of emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust). Again, visualize the relationship between the use of words from these categories and success. What is your finding?
nrc <- get_sentiments("nrc")
unique(nrc$sentiment)
## [1] "trust" "fear" "negative" "sadness" "anger"
## [6] "surprise" "positive" "disgust" "joy" "anticipation"
for (i in c("trust","fear","negative","sadness", "anger", "surprise","positive","disgust","joy","anticipation"))
{
assign(paste('nrc', '_', i, sep=""), dplyr::filter(nrc, sentiment==i))
}
## create NRC df
nrc_text <- data.frame(doc_id = topFRE$document, pledged = topFRE$pledged
, text= topFRE$blurb)
## assign emotions
for (i in 1:1000)
{
indiv_text = nrc_text$text[i]
tok <- quanteda::tokens(indiv_text)
for (x in c("trust","fear","negative","sadness", "anger", "surprise","positive","disgust","joy","anticipation"))
{
result = sum(tok[[1]]%in%unlist(subset(nrc, sentiment==x, select=word)))
nrc_text[i,x] = result
}
}
## create column with emotions
colnms=c("trust","fear","negative","sadness", "anger", "surprise","positive","disgust","joy","anticipation")
for (emolex in c("trust","fear","negative","sadness", "anger", "surprise","positive","disgust","joy","anticipation"))
{
print(ggplot2::ggplot(data = nrc_text, aes( x = nrc_text[,emolex], y = pledged)) +
geom_col() +
scale_y_continuous(labels = scales::comma) +
labs(
x = stringr::str_to_title(emolex),
y = "Amount Pledged",
title = paste(stringr::str_to_title(emolex), "Sentiment/Emotion")
) +
ggthemes::theme_tufte())
}
In the bar plots above, the data set only contains top 1000 successful projects. Projects with words that have positive association tend to have more pledged amounts, but like our sentiment analysis findings, blurbs that are too positive have diminishing returns. Emotions that are associated with positive sentiment such as trust, joy and anticipation all have more similar relationships with amount pledged (measure of success). Another important factor to note is how across all emotions/sentiments, the “0” value is dominant. This represents the limitations to using NRC Emolex dictionary, since it is also taking into consideration the corpuses that do not contain any matches from the dictionary.
## random_sent <- random_sent %>%
## mutate(rn = row_number())
## nrc_sent <- final_sent_dtm %>%
## inner_join(get_sentiments('nrc'), by='word')
## nrc_sent_n <- nrc_sent %>%
## group_by(sentiment)%>%
## tally %>%
## arrange(desc(n))
Please add the hash of your final commit in the feedback pull request to submit your homework. The homework is due on Monday, April 4 at 5pm.
If you do come across something online that provides part of the analysis / code etc., please no wholesale copying of other ideas. We are trying to evaluate your abilities to visualized data not the ability to do internet searches. Also, this is an individually assigned exercise – please keep your solution to yourself.